asynchronous q-learning
A Further related works
We now take a moment to discuss a small sample of other related works. Tsitsiklis, 1994; Jaakkola et al., 1994; Szepesvári, 1997), which enjoys a space complexity Finite-time guarantees of other variants of Q-learning have also been developed; partial examples include speedy Q-learning (Azar et al., 2011), double A common theme is to augment the original model-free update rule (e.g., the Q-learning update rule) by an exploration bonus, which typically takes the form of, say, certain upper confidence bounds (UCBs) motivated by the bandit literature (Lai and Robbins, 1985; Auer and Ortner, 2010). Model-based RL is known to be minimax-optimal in the presence of a simulator (Azar et al., 2013; Agarwal et al., 2020; Li et al., 2020a), beating the state-of-the-art model-free algorithms by achieving optimality for the entire sample size range (Li et al., 2020a). When it comes to online episodic RL, Azar et al. (2017) was the first work that managed to achieve The way to construct hard MDPs in Jaksch et al. (2010) has since been adapted by Jin et al. (2018) to exhibit a lower bound on episodic MDPs (with a sketched proof provided therein).
- Asia > Middle East > Jordan (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > Canada (0.04)
A Further related works
We now take a moment to discuss a small sample of other related works. Tsitsiklis, 1994; Jaakkola et al., 1994; Szepesvári, 1997), which enjoys a space complexity Finite-time guarantees of other variants of Q-learning have also been developed; partial examples include speedy Q-learning (Azar et al., 2011), double A common theme is to augment the original model-free update rule (e.g., the Q-learning update rule) by an exploration bonus, which typically takes the form of, say, certain upper confidence bounds (UCBs) motivated by the bandit literature (Lai and Robbins, 1985; Auer and Ortner, 2010). Model-based RL is known to be minimax-optimal in the presence of a simulator (Azar et al., 2013; Agarwal et al., 2020; Li et al., 2020a), beating the state-of-the-art model-free algorithms by achieving optimality for the entire sample size range (Li et al., 2020a). When it comes to online episodic RL, Azar et al. (2017) was the first work that managed to achieve The way to construct hard MDPs in Jaksch et al. (2010) has since been adapted by Jin et al. (2018) to exhibit a lower bound on episodic MDPs (with a sketched proof provided therein).
Review for NeurIPS paper: Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction
Weaknesses: My main concern about the paper is whether this proposed algorithm is actually implementable due to the specific expression of the (constant) learning rate. I have two concerns: 1. The learning rate depends on t_{mix} in Theorem 1 and on the universal constants c_1 in both Theorem 1 and Theorem 2. How can we compute/approximate t_{mix} in advance? If we cannot, is it sufficient to employ a lower-bound on t_{mix}? Looking at the proofs c_1 is a function of constant c (Equation 55) that in turn derives from Bernstein's inequality (Equation 81) and subsequently \tilde{c} (Equation 84), but its value is never explicitly computed. I am aware that also in [33] the learning rate schedule (that is not constant) depends on \mu_{min} and t_{mix}, but I think the authors should elaborate more on this and explain how to deal with it in practice, if possible.
Review for NeurIPS paper: Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction
The reviewers appreciated the efforts made by the authors in the rebuttal, and updated their reviews accordingly. The paper contributions are now clear and important (an improved sample complexity analysis of asynchronous Q-learning, and a novel variance reduction algorithm and its analysis). We recommend the paper for acceptance and encourage the authors to account for the reviewers' comments when preparing the camera-ready version of the paper.
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a \gamma -discounted MDP with state space S and action space A, we demonstrate that the \ell_{\infty} -based sample complexity of classical asynchronous Q-learning --- namely, the number of samples needed to yield an entrywise \epsilon -accurate estimate of the Q-function --- is at most on the order of \frac{1}{ \mu_{\min}(1-\gamma) 5 \epsilon 2 } \frac{ t_{\mathsf{mix}} }{ \mu_{\min}(1-\gamma) } up to some logarithmic factor, provided that a proper constant learning rate is adopted. The first term of this bound matches the complexity in the case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the expense taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least S A .
Unified ODE Analysis of Smooth Q-Learning Algorithms
Convergence of Q-learning has been the focus of extensive research over the past several decades. Recently, an asymptotic convergence analysis for Q-learning was introduced using a switching system framework. This approach applies the so-called ordinary differential equation (ODE) approach to prove the convergence of the asynchronous Q-learning modeled as a continuous-time switching system, where notions from switching system theory are used to prove its asymptotic stability without using explicit Lyapunov arguments. However, to prove stability, restrictive conditions, such as quasi-monotonicity, must be satisfied for the underlying switching systems, which makes it hard to easily generalize the analysis method to other reinforcement learning algorithms, such as the smooth Q-learning variants. In this paper, we present a more general and unified convergence analysis that improves upon the switching system approach and can analyze Q-learning and its smooth variants. The proposed analysis is motivated by previous work on the convergence of synchronous Q-learning based on $p$-norm serving as a Lyapunov function. However, the proposed analysis addresses more general ODE models that can cover both asynchronous Q-learning and its smooth versions with simpler frameworks.
- Asia > South Korea > Daejeon > Daejeon (0.04)
- Asia > Middle East > Jordan (0.04)